Travel Package Purchase - Connie Xavier¶

Problem Definition and Information¶

Context¶

A tourism company named "Visit with us" currently offers 5 types of packages - Basic, Standard, Deluxe, Super Deluxe, King. The company observed that 18% of the customers purchased the packages last year. However, it was difficult to identify the potential customers because customers were contacted at random without looking at the available information. The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. This time, the company wants to harness the available data of existing and potential customers to target the right customers.

As a Data scientist at "Visit With Us", I have to analyze the customers' data and information and provide recommendations to the Policy Maker and build a model to predict the potential customer who is going to purchase the newly introduced travel package. The model will be built to make predictions before a customer is contacted.

Objective¶

To analyze, visualize, and preprocess the data, and determine the best model from different ensemble models (bagging and boosting models along with tuned models) which can predict which customer is more likely to purchase the newly introduced travel package.

Key Questions¶

  1. Is there a good model to predict which customer is more likely to purchase the newly introduced travel package? What does the performance assessment look like for such a model?
  2. What are the key factors influencing whether a customer purchases a travel package?
  3. What would the advice be to grow the business?

Data Information¶

Each record in the database represents a customer's information. A detailed data dictionary can be found below.

Data Dictionary

Customer details:

  • CustomerID: Unique customer ID
  • ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
  • Age: Age of customer
  • TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
  • CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3. It's the city the customer lives in.
  • Occupation: Occupation of customer
  • Gender: Gender of customer
  • NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
  • PreferredPropertyStar: Preferred hotel property rating by customer
  • MaritalStatus: Marital status of customer
  • NumberOfTrips: Average number of trips in a year by customer
  • Passport: The customer has a passport or not (0: No, 1: Yes)
  • OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
  • NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
  • Designation: Designation of the customer in the current organization
  • MonthlyIncome: Gross monthly income of the customer

Customer Interaction Data:

  • PitchSatisfactionScore: Sales pitch satisfaction score
  • ProductPitched: Product pitched by the salesperson
  • NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
  • DurationOfPitch: Duration of the pitch by a salesperson to the customer

---------------------------------------------------------------------------------------------------------------¶

Part 1: Overview of Data¶

In [234]:
# import relevant libraries
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

# Library to suppress warnings or deprecation notes
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns

# Libraries to split data, impute missing values
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier

# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
)
from sklearn.model_selection import GridSearchCV
The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black
In [235]:
# load the data
data = pd.read_excel("Tourism.xlsx", "Tourism")
In [236]:
# check a sample of the data to make sure it came in correctly
data.sample(n=10, random_state=101)
Out[236]:
CustomerID ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation MonthlyIncome
803 200803 0 34.0 Company Invited 1 9.0 Salaried Male 2 4.0 Basic 3.0 Divorced 4.0 0 2 1 0.0 Executive 17979.0
589 200589 1 29.0 Self Enquiry 1 6.0 Salaried Female 2 4.0 Basic 5.0 Divorced 2.0 1 2 0 0.0 Executive 17319.0
3736 203736 0 40.0 Company Invited 3 27.0 Salaried Male 3 4.0 Deluxe 3.0 Married 4.0 0 3 1 1.0 Manager 22805.0
3996 203996 0 56.0 Self Enquiry 3 7.0 Salaried Male 4 4.0 Standard 3.0 Married 5.0 0 1 0 3.0 Senior Manager 28917.0
3491 203491 0 34.0 Company Invited 3 14.0 Small Business Male 3 4.0 Deluxe 5.0 Divorced 2.0 0 3 0 1.0 Manager 23051.0
1563 201563 0 46.0 Company Invited 1 6.0 Small Business Male 2 4.0 Standard 5.0 Married 3.0 1 1 1 1.0 Senior Manager 25673.0
2503 202503 0 38.0 Self Enquiry 1 7.0 Salaried Male 3 5.0 Deluxe 3.0 Married 3.0 0 5 1 2.0 Manager 24671.0
160 200160 0 22.0 Self Enquiry 1 25.0 Small Business Male 3 3.0 Basic 3.0 Divorced 2.0 0 2 0 1.0 Executive 17323.0
3114 203114 0 28.0 Self Enquiry 1 11.0 Salaried Female 4 4.0 Basic 3.0 Single 3.0 0 2 1 2.0 Executive 20996.0
1619 201619 0 19.0 Self Enquiry 1 9.0 Small Business Female 3 3.0 Basic 4.0 Single 2.0 0 3 1 0.0 Executive 16483.0
  • Looks like all the features came in.
  • CustomerID appears to be unique.
  • There are numerical and text columns.
  • No missing values shown in this sample section.
In [237]:
# check the shape
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns in the data.")
There are 4888 rows and 20 columns in the data.
In [238]:
# check that the ID column is unique
data.CustomerID.nunique()
Out[238]:
4888
  • Since all the values in the customer ID are unique, we can drop this column.
In [239]:
# checking for duplicate values
df = data.copy()
df = df.drop("CustomerID", axis=1)
In [240]:
# check datatypes of the columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ProdTaken                 4888 non-null   int64  
 1   Age                       4662 non-null   float64
 2   TypeofContact             4863 non-null   object 
 3   CityTier                  4888 non-null   int64  
 4   DurationOfPitch           4637 non-null   float64
 5   Occupation                4888 non-null   object 
 6   Gender                    4888 non-null   object 
 7   NumberOfPersonVisiting    4888 non-null   int64  
 8   NumberOfFollowups         4843 non-null   float64
 9   ProductPitched            4888 non-null   object 
 10  PreferredPropertyStar     4862 non-null   float64
 11  MaritalStatus             4888 non-null   object 
 12  NumberOfTrips             4748 non-null   float64
 13  Passport                  4888 non-null   int64  
 14  PitchSatisfactionScore    4888 non-null   int64  
 15  OwnCar                    4888 non-null   int64  
 16  NumberOfChildrenVisiting  4822 non-null   float64
 17  Designation               4888 non-null   object 
 18  MonthlyIncome             4655 non-null   float64
dtypes: float64(7), int64(6), object(6)
memory usage: 725.7+ KB
In [241]:
# check which columns have null values
df.isna().sum()[df.isna().sum() > 0]
Out[241]:
Age                         226
TypeofContact                25
DurationOfPitch             251
NumberOfFollowups            45
PreferredPropertyStar        26
NumberOfTrips               140
NumberOfChildrenVisiting     66
MonthlyIncome               233
dtype: int64
  • There are null values for 8 columns. We will have to treat these missing values.
  • The dependent variable, ProdTaken, is of integer type.
  • All variables are of float, integer, or object type.
  • Some of the variables that are of float datatype are actually int datatype but have missing values.
  • We will want to convert categorical variables to category datatype.
In [242]:
# check the unique values for the categorical variables
# will treat CityTier as categorical although it is a ordinal variable to see its importance in the model
cat_cols = list(df.select_dtypes(include="object").columns) + [
    "ProdTaken",
    "CityTier",
    "Passport",
    "OwnCar",
]
for i in cat_cols:
    print(df[i].value_counts(normalize=True))
    print("-" * 50)
Self Enquiry       0.708205
Company Invited    0.291795
Name: TypeofContact, dtype: float64
--------------------------------------------------
Salaried          0.484452
Small Business    0.426350
Large Business    0.088789
Free Lancer       0.000409
Name: Occupation, dtype: float64
--------------------------------------------------
Male       0.596563
Female     0.371727
Fe Male    0.031710
Name: Gender, dtype: float64
--------------------------------------------------
Basic           0.376841
Deluxe          0.354337
Standard        0.151800
Super Deluxe    0.069967
King            0.047054
Name: ProductPitched, dtype: float64
--------------------------------------------------
Married      0.478723
Divorced     0.194354
Single       0.187398
Unmarried    0.139525
Name: MaritalStatus, dtype: float64
--------------------------------------------------
Executive         0.376841
Manager           0.354337
Senior Manager    0.151800
AVP               0.069967
VP                0.047054
Name: Designation, dtype: float64
--------------------------------------------------
0    0.811784
1    0.188216
Name: ProdTaken, dtype: float64
--------------------------------------------------
1    0.652619
3    0.306874
2    0.040507
Name: CityTier, dtype: float64
--------------------------------------------------
0    0.709083
1    0.290917
Name: Passport, dtype: float64
--------------------------------------------------
1    0.620295
0    0.379705
Name: OwnCar, dtype: float64
--------------------------------------------------
  • The Gender column has a Fe Male category. This needs to be replaced with Female.
  • All other columns take on an expected range of values.
In [243]:
# replace "Fe Male" with "Female"
df.Gender.replace("Fe Male", "Female", inplace=True)
df.Gender.value_counts()
Out[243]:
Male      2916
Female    1972
Name: Gender, dtype: int64
In [244]:
# look at the statistical summary of the data
df.describe(include="all").T
Out[244]:
count unique top freq mean std min 25% 50% 75% max
ProdTaken 4888.0 NaN NaN NaN 0.188216 0.390925 0.0 0.0 0.0 0.0 1.0
Age 4662.0 NaN NaN NaN 37.622265 9.316387 18.0 31.0 36.0 44.0 61.0
TypeofContact 4863 2 Self Enquiry 3444 NaN NaN NaN NaN NaN NaN NaN
CityTier 4888.0 NaN NaN NaN 1.654255 0.916583 1.0 1.0 1.0 3.0 3.0
DurationOfPitch 4637.0 NaN NaN NaN 15.490835 8.519643 5.0 9.0 13.0 20.0 127.0
Occupation 4888 4 Salaried 2368 NaN NaN NaN NaN NaN NaN NaN
Gender 4888 2 Male 2916 NaN NaN NaN NaN NaN NaN NaN
NumberOfPersonVisiting 4888.0 NaN NaN NaN 2.905074 0.724891 1.0 2.0 3.0 3.0 5.0
NumberOfFollowups 4843.0 NaN NaN NaN 3.708445 1.002509 1.0 3.0 4.0 4.0 6.0
ProductPitched 4888 5 Basic 1842 NaN NaN NaN NaN NaN NaN NaN
PreferredPropertyStar 4862.0 NaN NaN NaN 3.581037 0.798009 3.0 3.0 3.0 4.0 5.0
MaritalStatus 4888 4 Married 2340 NaN NaN NaN NaN NaN NaN NaN
NumberOfTrips 4748.0 NaN NaN NaN 3.236521 1.849019 1.0 2.0 3.0 4.0 22.0
Passport 4888.0 NaN NaN NaN 0.290917 0.454232 0.0 0.0 0.0 1.0 1.0
PitchSatisfactionScore 4888.0 NaN NaN NaN 3.078151 1.365792 1.0 2.0 3.0 4.0 5.0
OwnCar 4888.0 NaN NaN NaN 0.620295 0.485363 0.0 0.0 1.0 1.0 1.0
NumberOfChildrenVisiting 4822.0 NaN NaN NaN 1.187267 0.857861 0.0 1.0 1.0 2.0 3.0
Designation 4888 5 Executive 1842 NaN NaN NaN NaN NaN NaN NaN
MonthlyIncome 4655.0 NaN NaN NaN 23619.853491 5380.698361 1000.0 20346.0 22347.0 25571.0 98678.0

Significant observations:

  • Most customers did not purchase a package since the 75% value of ProdTaken is 0.
  • Age appears to be uniformally distributed since mean and median are close.
  • Most customers were contacted by Self Enquiry, are salaried, male, pitched the Basic service, married and have an Executive desgination.
  • Most customers are from Tier 1 cities.
  • DurationOfPitch appears to be right skewed, with an extreme value of 127. We will have to look at this more.
  • MonthlyIncome takes on a wide range of values, from 1,000 to about 99,000. There may be outliers.
  • PitchSatisfactionScore appears to be uniformally distributed, as it is evenly distributed from min to max.
  • NumberofTrips appears to have outliers as the 75% and max value jump from 4 to 22.

Part 2: Exploratory Data Analysis and Data Preprocessing¶

Exploratory data analysis and data preprocessing involving missing value and outlier detection and treatment often depend on each other, so they are included together below.

Univariate Analysis¶

In [245]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [246]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [247]:
## plot histogram and boxplot for the numerical features
num_cols = [
    "Age",
    "DurationOfPitch",
    "MonthlyIncome",
    "NumberOfPersonVisiting",
    "PitchSatisfactionScore",
    "NumberOfFollowups",
    "PreferredPropertyStar",
    "NumberOfTrips",
    "NumberOfChildrenVisiting",
]
for i in num_cols:
    print(i)
    histogram_boxplot(df, i)
    plt.show()
    print(
        " ****************************************************************** "
    )  ## To create a separator
Age
 ****************************************************************** 
DurationOfPitch
 ****************************************************************** 
MonthlyIncome
 ****************************************************************** 
NumberOfPersonVisiting
 ****************************************************************** 
PitchSatisfactionScore
 ****************************************************************** 
NumberOfFollowups
 ****************************************************************** 
PreferredPropertyStar
 ****************************************************************** 
NumberOfTrips
 ****************************************************************** 
NumberOfChildrenVisiting
 ****************************************************************** 
  • Age appears almost normally distributed. There are no outliers in age. The mean is higher than the median, indicating a slight skew.
  • DurationOfPitch has a few outliers to the right causing the distribution to be skewed to the right. We should look more closely at these outliers to see if we need to treat them.
  • MonthlyIncome is right-skewed with a few extreme outliers to the right and left.
  • NumberOfPersonVisiting ranges from 1-5, and has an outlier to the right, but it appears to be a reasonable value.
  • PitchSatisfactionScore is almost normally distributed with no outliers.
  • NumberOfFollowups has outliers to the right and left but these are within a reasonable range of values.
  • PreferredPropertyStar is right skewed, as most customers prefer property with 3 stars.
  • NumberOfTrips is right skewed, and has some extreme outliers to the right.
  • NumberOfChildrenVisiting appears to be almost normally distributed.
In [248]:
## Barplot for the categorical features
for i in cat_cols:
    print(i)
    labeled_barplot(df, i, perc=True)
    plt.show()
    print(
        " ****************************************************************** "
    )  ## To create a separator
TypeofContact
 ****************************************************************** 
Occupation
 ****************************************************************** 
Gender
 ****************************************************************** 
ProductPitched
 ****************************************************************** 
MaritalStatus
 ****************************************************************** 
Designation
 ****************************************************************** 
ProdTaken
 ****************************************************************** 
CityTier
 ****************************************************************** 
Passport
 ****************************************************************** 
OwnCar
 ****************************************************************** 
  • 71% of customers made contact by Self Enquiry.
  • Very few (<1%) of customers are Free Lancer.
  • Most (60%) of customers are Male.
  • The least expensive product (Basic) is pitched to customers.
  • A majority (48%) of customers are married, while all other categories are have a similar split to each other.
  • Most (38%) of customers hold an Executive designation.
  • 81% of customers did not purchase a package, meaning this is a heavily imbalanced dataset.
  • Most (65%) of customers are from Tier 1 cities which is the highest level. Very few (4%) are from a mid-tier city.
  • Most (71%) of customers have a passport.
  • Most (62%) of customers own a car.

Outlier Treatment¶

DurationOfPitch, MonthlyIncome, and NumberOfTrips had some extreme outliers based on the univariate distributions

In [249]:
# look at leftmost outliers for MonthlyIncome
df1 = df.copy()
df1.sort_values(by="MonthlyIncome").head()
Out[249]:
ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation MonthlyIncome
142 0 38.0 Self Enquiry 1 9.0 Large Business Female 2 3.0 Deluxe 3.0 Single 4.0 1 5 0 0.0 Manager 1000.0
2586 0 39.0 Self Enquiry 1 10.0 Large Business Female 3 4.0 Deluxe 3.0 Single 5.0 1 5 0 1.0 Manager 4678.0
513 1 20.0 Self Enquiry 1 16.0 Small Business Male 2 3.0 Basic 3.0 Single 2.0 1 5 0 0.0 Executive 16009.0
1983 1 20.0 Self Enquiry 1 16.0 Small Business Male 2 3.0 Basic 3.0 Single 2.0 1 5 1 1.0 Executive 16009.0
2197 0 18.0 Company Invited 1 11.0 Salaried Male 3 3.0 Basic 3.0 Single 2.0 0 1 0 1.0 Executive 16051.0
  • The leftmost outliers have Manager designations.
In [250]:
# look at rightmost outliers for MonthlyIncome
df1.sort_values(by="MonthlyIncome", na_position="last", ascending=False).head(5)
Out[250]:
ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation MonthlyIncome
2482 0 37.0 Self Enquiry 1 12.0 Salaried Female 3 5.0 Basic 5.0 Divorced 2.0 1 2 1 1.0 Executive 98678.0
38 0 36.0 Self Enquiry 1 11.0 Salaried Female 2 4.0 Basic NaN Divorced 1.0 1 2 1 0.0 Executive 95000.0
4104 0 53.0 Self Enquiry 1 7.0 Salaried Male 4 5.0 King NaN Married 2.0 0 1 1 3.0 VP 38677.0
2634 0 53.0 Self Enquiry 1 7.0 Salaried Male 4 5.0 King NaN Divorced 2.0 0 2 1 2.0 VP 38677.0
4660 0 42.0 Company Invited 1 14.0 Salaried Female 3 6.0 King NaN Married 3.0 0 4 1 2.0 VP 38651.0
  • The outliers on the higher end both have Executive positions.
In [251]:
# see relationship between Designation and MonthlyIncome
sns.boxplot(data=df1, x="Designation", y="MonthlyIncome")
plt.show()
  • The outliers are evident when plotting against a customer's position. There is a clear pattern with position and salary, so we will impute the outliers with the median based on designation.
In [252]:
# impute outliers with the median
df1.loc[df1.MonthlyIncome < 15000, "MonthlyIncome"] = df1[df1.Designation == "Manager"][
    "MonthlyIncome"
].median()
df1.loc[df1.MonthlyIncome > 40000, "MonthlyIncome"] = df1[
    df1.Designation == "Executive"
]["MonthlyIncome"].median()
In [253]:
# find the values for DurationofPitch that are greater than 4*IQR from the median
quartiles = np.quantile(
    df1["DurationOfPitch"][df1["DurationOfPitch"].notnull()], [0.25, 0.75]
)
dop_4iqr = 4 * (quartiles[1] - quartiles[0])
outlier_dop = df1.loc[
    np.abs(df1["DurationOfPitch"] - df1["DurationOfPitch"].median()) > dop_4iqr,
    "DurationOfPitch",
]
print(outlier_dop.sort_values(ascending=False).count() / df1.shape[0] * 100, "%")
df1.loc[outlier_dop.sort_values().index]
0.04091653027823241 %
Out[253]:
ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation MonthlyIncome
1434 0 NaN Company Invited 3 126.0 Salaried Male 2 3.0 Basic 3.0 Married 3.0 0 1 1 1.0 Executive 18482.0
3878 0 53.0 Company Invited 3 127.0 Salaried Male 3 4.0 Basic 3.0 Married 4.0 0 1 1 2.0 Executive 22160.0
  • There are only two observations. We will drop these.
In [254]:
# drop outliers in DurationOfPitch
df1 = df1[~df1.index.isin(list(outlier_dop.index))]
In [255]:
# find the values for NumberOfTrips that are greater than 4*IQR from the median
quartiles = np.quantile(
    df1["NumberOfTrips"][df1["NumberOfTrips"].notnull()], [0.25, 0.75]
)
not_4iqr = 4 * (quartiles[1] - quartiles[0])
outlier_not = df1.loc[
    np.abs(df1["NumberOfTrips"] - df1["NumberOfTrips"].median()) > not_4iqr,
    "NumberOfTrips",
]
print(outlier_not.sort_values(ascending=False).count() / df1.shape[0] * 100, "%")
df1.loc[outlier_not.sort_values().index]
0.08186655751125665 %
Out[255]:
ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation MonthlyIncome
385 1 30.0 Company Invited 1 10.0 Large Business Male 2 3.0 Basic 3.0 Single 19.0 1 4 1 1.0 Executive 17285.0
2829 1 31.0 Company Invited 1 11.0 Large Business Male 3 4.0 Basic 3.0 Single 20.0 1 4 1 2.0 Executive 20963.0
816 0 39.0 Company Invited 1 15.0 Salaried Male 3 3.0 Deluxe 4.0 Unmarried 21.0 0 2 1 0.0 Manager 21782.0
3260 0 40.0 Company Invited 1 16.0 Salaried Male 4 4.0 Deluxe 4.0 Unmarried 22.0 0 2 1 1.0 Manager 25460.0
  • We will drop these extreme observations.
In [256]:
# drop outliers in NumberOfTrips
df1 = df1[~df1.index.isin(list(outlier_not.index))]
In [257]:
# view distributions after removal of outliers
cols = [
    "DurationOfPitch",
    "MonthlyIncome",
    "NumberOfTrips",
]
for i in cols:
    print(i)
    histogram_boxplot(df1, i)
    plt.show()
    print(
        " ****************************************************************** "
    )  ## To create a separator
DurationOfPitch
 ****************************************************************** 
MonthlyIncome
 ****************************************************************** 
NumberOfTrips
 ****************************************************************** 
  • Extreme outliers have successfully been modified or removed.

Bivariate Analysis¶

With extreme values removed, we will get a better idea of the relationships between variables.

In [258]:
# correlation plot
plt.figure(figsize=(15, 7))
sns.heatmap(df1.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • ProdTaken is most correlated with having a passport or not.
  • Age and MonthlyIncome are moderately postively correlated.
  • NumberofChildrenVisiting and NumberOfPersonVisiting are postiively correlated.
  • There are not many strong correlations in the data.

We will see how the target variable varies depending on other features.

In [259]:
# plot numerical features against each other
sns.pairplot(data=df1, hue="ProdTaken", vars=num_cols)
plt.show()
  • There is some positive correlation between MonthlyIncome and Age.
  • There appears to be a lot of overlap looking at this, so we will break it down more.
In [260]:
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
In [261]:
# plot numerical variables with respect to target
sns.set(font_scale=1)
for i in num_cols:
    distribution_plot_wrt_target(df1, i, "ProdTaken")
    plt.show()
    print("*" * 100)
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
****************************************************************************************************
  • Customers who purchased a product are typically younger than those who didn't (median age of around 34 vs. 37)
  • Customers who purchased a product were pitched to for longer.
  • Surprisingly, customers who purchased a product make lower monthly incomes than those who did not purchase a product. However, the range of monthly incomes for both groups of customers are similar.
  • Customers who purchased a product have similar distributions for NumberOfPersonVisiting, PitchSatisfactionScore, NumberOfTrips, NumberOfChildrenVisiting,and NumberOfFollowups.
  • Customer who purchased the product had a higher density of property rating of 5.
In [262]:
# examing distribution of Age and MonthlyIncome by target
sns.swarmplot(data=df1, x="ProdTaken", y="Age")
plt.show()
sns.swarmplot(data=df1, x="ProdTaken", y="MonthlyIncome")
plt.show()
  • The bulk of the data for customers who purchased the product can be seen at the lower incomes (around 22K) and lower age (around 30) compared to the more uniform distribution for customers who did not purchase a package.
In [263]:
# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [264]:
# plot categorical variables with respect to target
othercols = cat_cols.copy()
othercols.remove("ProdTaken")
for i in othercols:
    print(i)
    stacked_barplot(df1, i, "ProdTaken")
    plt.show()
    print(
        " ****************************************************************** "
    )  ## To create a separator
TypeofContact
ProdTaken           0    1   All
TypeofContact                   
All              3942  915  4857
Self Enquiry     2837  607  3444
Company Invited  1105  308  1413
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Occupation
ProdTaken          0    1   All
Occupation                     
All             3964  918  4882
Salaried        1950  414  2364
Small Business  1700  384  2084
Large Business   314  118   432
Free Lancer        0    2     2
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Gender
ProdTaken     0    1   All
Gender                    
All        3964  918  4882
Male       2334  576  2910
Female     1630  342  1972
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
ProductPitched
ProdTaken          0    1   All
ProductPitched                 
All             3964  918  4882
Basic           1288  550  1838
Deluxe          1526  204  1730
Standard         618  124   742
King             210   20   230
Super Deluxe     322   20   342
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
MaritalStatus
ProdTaken         0    1   All
MaritalStatus                 
All            3964  918  4882
Married        2012  326  2338
Single          612  302   914
Unmarried       514  166   680
Divorced        826  124   950
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Designation
ProdTaken          0    1   All
Designation                    
All             3964  918  4882
Executive       1288  550  1838
Manager         1526  204  1730
Senior Manager   618  124   742
AVP              322   20   342
VP               210   20   230
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
CityTier
ProdTaken     0    1   All
CityTier                  
All        3964  918  4882
1          2668  518  3186
3          1144  354  1498
2           152   46   198
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Passport
ProdTaken     0    1   All
Passport                  
All        3964  918  4882
1           928  492  1420
0          3036  426  3462
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
OwnCar
ProdTaken     0    1   All
OwnCar                    
All        3964  918  4882
1          2468  558  3026
0          1496  360  1856
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
  • Customers who were invited by the company are more likely to purchase a product, but not a huge difference.
  • All freelancers (only 2 values in the data) bought the product. Customers from large business are more likely to buy than customers of small businesses.
  • Gender is not a strong determinant of who buys the product.
  • Customers who were pitched the Basic product were more likely to buy a product.
  • Single customers are more likely to buy a product.
  • Executives are more likely to buy a product, and the highest two positions (VP and AVP) are least likely.
  • Customers from lower tier cities and with a passport are more likely to buy the product.
  • Owning a car does not seem to be a significant factor to buying a product.
In [265]:
# see how TypeofContact varies with other variables as it did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
for i, variable in enumerate(num_cols):
    plt.subplot(5, 2, i + 1)
    sns.boxplot(
        df1["TypeofContact"], df1[variable], hue=df1["ProdTaken"], palette="PuBu"
    )
    plt.tight_layout()
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.title(variable)
plt.show()
  • The most significant difference can be found with respect to PitchSatisfactionScore. The customers who were contacted by Company Invite and took the product, gave a higher PitchSatisfactionScore than the customers who did a Self Enquiry.
In [266]:
# see how Gender varies with other variables as it did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
for i, variable in enumerate(num_cols):
    plt.subplot(5, 2, i + 1)
    sns.boxplot(df1["Gender"], df1[variable], hue=df1["ProdTaken"], palette="PuBu")
    plt.tight_layout()
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.title(variable)
plt.show()
  • Gender does not appear to have a significant difference in purchasing a product with respect to other variables.
In [267]:
# see how OwnCar varies with other variables as those variables did not have a significant difference with respect to the target
plt.figure(figsize=(20, 20))
for i, variable in enumerate(num_cols):
    plt.subplot(5, 2, i + 1)
    sns.boxplot(df1["OwnCar"], df1[variable], hue=df1["ProdTaken"], palette="PuBu")
    plt.tight_layout()
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.title(variable)
plt.show()
  • There does not appear to be a strong relationship between buying a product and owning a car, even with respect to other variables.

Customer Profiles¶

Create a profile of the customers (for example demographic information) who purchased a package. The profile has to be created for each of the 5 packages.

In [268]:
# filter for customers who purchased a package
df_cust = df1[df1.ProdTaken == 1]
order = ["Basic", "Standard", "Deluxe", "Super Deluxe", "King"]
In [269]:
# distribution of packages bought
sns.set(font_scale=1)
sns.countplot(data=df_cust, x="ProductPitched", order=order)
Out[269]:
<AxesSubplot:xlabel='ProductPitched', ylabel='count'>
  • Basic is the most popular package bought, followed by the mid-tier option (Deluxe).
  • The most expensive packages (Super Deluxe and King) are the least purchased.
In [270]:
# plot package type vs numerical variables
plt.figure(figsize=(25, 30))
sns.set(font_scale=2)
for i, name in enumerate(num_cols):
    plt.subplot(5, 2, i + 1)
    sns.boxplot(data=df_cust, y=name, x="ProductPitched", order=order)
    plt.tight_layout()
    plt.title(name)
  • There are some clear distinctions with MonthlyIncome, Age, DurationOfPitch, PitchSatisfactionScore, NumberOfFollowups vs. ProductPitched.
In [271]:
# plot statistics by Product Pitched
for i in order:
    print("Statistics for ", i)
    df_sub = df_cust[df_cust.ProductPitched == i]
    display(df_sub.describe().T)
    print("*" * 50)
Statistics for  Basic
count mean std min 25% 50% 75% max
ProdTaken 550.0 1.000000 0.000000 1.0 1.0 1.0 1.0 1.0
Age 513.0 31.292398 9.088340 18.0 25.0 30.0 35.0 59.0
CityTier 550.0 1.512727 0.833509 1.0 1.0 1.0 2.0 3.0
DurationOfPitch 530.0 15.811321 7.915090 6.0 9.0 14.0 22.0 36.0
NumberOfPersonVisiting 550.0 2.907273 0.701639 2.0 2.0 3.0 3.0 4.0
NumberOfFollowups 546.0 3.952381 0.968079 1.0 3.0 4.0 5.0 6.0
PreferredPropertyStar 550.0 3.774545 0.862119 3.0 3.0 3.0 5.0 5.0
NumberOfTrips 545.0 3.166972 1.836019 1.0 2.0 3.0 3.0 8.0
Passport 550.0 0.581818 0.493709 0.0 0.0 1.0 1.0 1.0
PitchSatisfactionScore 550.0 3.210909 1.354702 1.0 2.0 3.0 4.0 5.0
OwnCar 550.0 0.570909 0.495397 0.0 0.0 1.0 1.0 1.0
NumberOfChildrenVisiting 549.0 1.220401 0.867427 0.0 1.0 1.0 2.0 3.0
MonthlyIncome 527.0 20165.466793 3317.026073 16009.0 17546.0 20582.0 21406.5 37868.0
**************************************************
Statistics for  Standard
count mean std min 25% 50% 75% max
ProdTaken 124.0 1.000000 0.000000 1.0 1.00 1.0 1.00 1.0
Age 123.0 41.008130 9.876695 19.0 33.00 38.0 49.00 60.0
CityTier 124.0 2.096774 0.966255 1.0 1.00 3.0 3.00 3.0
DurationOfPitch 123.0 19.065041 9.048811 6.0 11.00 17.0 29.00 36.0
NumberOfPersonVisiting 124.0 2.967742 0.709236 2.0 2.00 3.0 3.00 4.0
NumberOfFollowups 124.0 3.935484 0.908335 1.0 3.00 4.0 4.25 6.0
PreferredPropertyStar 123.0 3.731707 0.878460 3.0 3.00 3.0 5.00 5.0
NumberOfTrips 123.0 3.016260 1.815163 1.0 2.00 2.0 4.00 8.0
Passport 124.0 0.387097 0.489062 0.0 0.00 0.0 1.00 1.0
PitchSatisfactionScore 124.0 3.467742 1.309350 1.0 3.00 3.0 5.00 5.0
OwnCar 124.0 0.661290 0.475191 0.0 0.00 1.0 1.00 1.0
NumberOfChildrenVisiting 123.0 1.121951 0.901596 0.0 0.00 1.0 2.00 3.0
MonthlyIncome 124.0 26035.419355 3593.290353 17372.0 23974.75 25711.0 28628.00 38395.0
**************************************************
Statistics for  Deluxe
count mean std min 25% 50% 75% max
ProdTaken 204.0 1.000000 0.000000 1.0 1.0 1.0 1.0 1.0
Age 198.0 37.641414 8.469575 21.0 32.0 35.5 44.0 59.0
CityTier 204.0 2.411765 0.913532 1.0 1.0 3.0 3.0 3.0
DurationOfPitch 180.0 19.100000 9.227176 6.0 11.0 16.0 28.0 36.0
NumberOfPersonVisiting 204.0 2.950980 0.707141 2.0 2.0 3.0 3.0 4.0
NumberOfFollowups 200.0 3.970000 1.051011 1.0 3.0 4.0 5.0 6.0
PreferredPropertyStar 203.0 3.699507 0.857899 3.0 3.0 3.0 5.0 5.0
NumberOfTrips 202.0 3.702970 2.022483 1.0 2.0 3.0 5.0 8.0
Passport 204.0 0.490196 0.501134 0.0 0.0 0.0 1.0 1.0
PitchSatisfactionScore 204.0 3.039216 1.278250 1.0 2.0 3.0 4.0 5.0
OwnCar 204.0 0.607843 0.489432 0.0 0.0 1.0 1.0 1.0
NumberOfChildrenVisiting 203.0 1.172414 0.841279 0.0 1.0 1.0 2.0 3.0
MonthlyIncome 195.0 23106.215385 3592.466947 17086.0 20744.0 23186.0 24506.0 38525.0
**************************************************
Statistics for  Super Deluxe
count mean std min 25% 50% 75% max
ProdTaken 20.0 1.000000 0.000000 1.0 1.0 1.0 1.00 1.0
Age 20.0 43.500000 4.839530 39.0 40.0 42.0 45.25 56.0
CityTier 20.0 2.600000 0.820783 1.0 3.0 3.0 3.00 3.0
DurationOfPitch 20.0 18.500000 7.330542 8.0 15.0 18.5 20.00 31.0
NumberOfPersonVisiting 20.0 2.700000 0.656947 2.0 2.0 3.0 3.00 4.0
NumberOfFollowups 20.0 3.100000 1.618967 1.0 2.0 3.0 4.00 6.0
PreferredPropertyStar 20.0 3.600000 0.820783 3.0 3.0 3.0 4.00 5.0
NumberOfTrips 19.0 3.263158 2.490919 1.0 1.0 2.0 5.50 8.0
Passport 20.0 0.600000 0.502625 0.0 0.0 1.0 1.00 1.0
PitchSatisfactionScore 20.0 3.800000 1.005249 3.0 3.0 3.0 5.00 5.0
OwnCar 20.0 1.000000 0.000000 1.0 1.0 1.0 1.00 1.0
NumberOfChildrenVisiting 20.0 1.200000 0.833509 0.0 1.0 1.0 2.00 3.0
MonthlyIncome 20.0 29823.800000 3520.426404 21151.0 28129.5 29802.5 31997.25 37502.0
**************************************************
Statistics for  King
count mean std min 25% 50% 75% max
ProdTaken 20.0 1.000000 0.000000 1.0 1.00 1.0 1.0 1.0
Age 20.0 48.900000 9.618513 27.0 42.00 52.5 56.0 59.0
CityTier 20.0 1.800000 1.005249 1.0 1.00 1.0 3.0 3.0
DurationOfPitch 20.0 10.500000 4.135851 8.0 8.00 9.0 9.0 19.0
NumberOfPersonVisiting 20.0 2.900000 0.718185 2.0 2.00 3.0 3.0 4.0
NumberOfFollowups 20.0 4.300000 1.128576 3.0 3.00 4.0 5.0 6.0
PreferredPropertyStar 16.0 3.750000 0.683130 3.0 3.00 4.0 4.0 5.0
NumberOfTrips 17.0 3.411765 1.938389 1.0 2.00 3.0 4.0 7.0
Passport 20.0 0.600000 0.502625 0.0 0.00 1.0 1.0 1.0
PitchSatisfactionScore 20.0 3.300000 1.218282 1.0 3.00 3.0 4.0 5.0
OwnCar 20.0 0.900000 0.307794 0.0 1.00 1.0 1.0 1.0
NumberOfChildrenVisiting 16.0 1.437500 0.892095 0.0 1.00 1.0 2.0 3.0
MonthlyIncome 20.0 34672.100000 5577.603833 17517.0 34470.25 34859.0 38223.0 38537.0
**************************************************
  1. Basic
  • Age spans wide range from 18-59, with median of 30.
  • Customers make the lowest median monthly income of 21k.
  1. Standard
  • Age spans wide range from 19-60, with median of 38.
  1. Deluxe
  • Age spans wide range, with median of 36.
  1. Super Deluxe
  • Age covers older customers from 39-56, with median age of 42.
  • Customers have the most narrow range of incomce from 22k to 38k.
  • Customers who bought this package required a lower median number of followups (3) comapared to other packages (4).
  1. King
  • Age covers 27-59, with the highest median age of all packages of 53.
  • Customers make the highest median monthly income of 35k.
  • Customers experienced the shortest median duration of pitch compared to all other packages.
In [272]:
# plot categorical variables with respect to ProductPitched
othercols = cat_cols.copy()
othercols.remove("ProductPitched")
sns.set(font_scale=1)
for i in othercols:
    print(i)
    stacked_barplot(df_cust, i, "ProductPitched")
    plt.show()
    print(
        " ****************************************************************** "
    )  ## To create a separator
TypeofContact
ProductPitched   Basic  Deluxe  King  Standard  Super Deluxe  All
TypeofContact                                                    
All                547     204    20       124            20  915
Company Invited    192      68     0        32            16  308
Self Enquiry       355     136    20        92             4  607
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Occupation
ProductPitched  Basic  Deluxe  King  Standard  Super Deluxe  All
Occupation                                                      
All               550     204    20       124            20  918
Salaried          260      80     4        54            16  414
Small Business    202     108    12        58             4  384
Free Lancer         2       0     0         0             0    2
Large Business     86      16     4        12             0  118
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Gender
ProductPitched  Basic  Deluxe  King  Standard  Super Deluxe  All
Gender                                                          
All               550     204    20       124            20  918
Male              342     134     8        76            16  576
Female            208      70    12        48             4  342
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
MaritalStatus
ProductPitched  Basic  Deluxe  King  Standard  Super Deluxe  All
MaritalStatus                                                   
All               550     204    20       124            20  918
Single            228      45     8        11            10  302
Married           188      68     6        56             8  326
Unmarried          74      59     0        31             2  166
Divorced           60      32     6        26             0  124
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Designation
ProductPitched  Basic  Deluxe  King  Standard  Super Deluxe  All
Designation                                                     
AVP                 0       0     0         0            20   20
All               550     204    20       124            20  918
Executive         550       0     0         0             0  550
Manager             0     204     0         0             0  204
Senior Manager      0       0     0       124             0  124
VP                  0       0    20         0             0   20
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
ProdTaken
ProductPitched  Basic  Deluxe  King  Standard  Super Deluxe  All
ProdTaken                                                       
1                 550     204    20       124            20  918
All               550     204    20       124            20  918
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
CityTier
ProductPitched  Basic  Deluxe  King  Standard  Super Deluxe  All
CityTier                                                        
All               550     204    20       124            20  918
3                 122     144     8        64            16  354
1                 390      60    12        52             4  518
2                  38       0     0         8             0   46
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
Passport
ProductPitched  Basic  Deluxe  King  Standard  Super Deluxe  All
Passport                                                        
All               550     204    20       124            20  918
1                 320     100    12        48            12  492
0                 230     104     8        76             8  426
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
OwnCar
ProductPitched  Basic  Deluxe  King  Standard  Super Deluxe  All
OwnCar                                                          
1                 314     124    18        82            20  558
All               550     204    20       124            20  918
0                 236      80     2        42             0  360
------------------------------------------------------------------------------------------------------------------------
 ****************************************************************** 
  1. Basic
  • Frelancers prefer this package, but more data needed.
  • Customers are executives.
  1. Standard
  • More married customers bought this than customers who are not married.
  • Customers are senior managers.
  1. Deluxe
  • More customers from small business purchase this compared to other occupations.
  • Customers are managers.
  • Customers are from Tier 3 cities mostly.
  1. Super Deluxe
  • More contacted through company invite than self inquiry, unlike other packages.
  • Preferred by salaried customers over other occupations.
  • Customers are AVP.
  • All customers own a car.
  1. King
  • All customers contacted through Self Enquiry, unlike other packages.
  • More customers from small business purchase this compared to other occupations.
  • More female customers bought this than male customers.
  • Customers are VP
In [273]:
# explore how Age varies with gender and marital status for each package
sns.set(font_scale=1)
sns.catplot(data=df_cust, x="MaritalStatus", y="Age", col="ProductPitched", kind="bar")
plt.show()
sns.catplot(data=df_cust, x="Gender", y="Age", col="ProductPitched", kind="bar")
plt.show()
  • Older males prefer the King package.
  • No unmarried customers for King or Super Deluxe packages.
In [274]:
# feature extraction
df_cust["Age_bin"] = pd.cut(
    x=df_cust["Age"],
    bins=[18, 30, 40, 50, 61],
    labels=["18-30", "31-40", "41-50", ">50"],
)

df_cust["Income_bin"] = pd.cut(
    x=df_cust["MonthlyIncome"],
    bins=[15000, 20000, 25000, 30000, 35000, 40000],
    labels=["15K - 20K", "20K - 25K", "25K - 30K", "30K - 35K", ">35K"],
)
In [276]:
# see how the product varies with age and income
sns.countplot(data=df_cust, x="Age_bin", hue="ProductPitched")
plt.show()
sns.countplot(data=df_cust, x="Income_bin", hue="ProductPitched")
plt.show()
  • We can clearly see the patterns that King customers are concentrated at higher ages (41+) and higher incomes (30K+)
  • Basic packages are popular among younger customers (18-30) and lower incomes (<25k)

Missing Value Treatment¶

Let's looking at the columns that have missing values and see if we can find a relationship for imputation.

In [199]:
# find number of missing values for each column
df2 = df1.copy()
df2.isna().sum()[df2.isna().sum() > 0]
Out[199]:
Age                         225
TypeofContact                25
DurationOfPitch             251
NumberOfFollowups            45
PreferredPropertyStar        26
NumberOfTrips               140
NumberOfChildrenVisiting     66
MonthlyIncome               233
dtype: int64
In [200]:
# MonthlyIncome - MonthlyIncome most likely varies based on Occupation and Desigination
plt.figure(figsize=(15, 10))
sns.set(font_scale=1)
sns.boxplot(data=df2, x="Occupation", y="MonthlyIncome", hue="Designation")
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
Out[200]:
<matplotlib.legend.Legend at 0x272da5ac940>
  • Median pay is mostly similar across occupation with slight differences, and varies based on Designation.
  • VPs get the highest pay and Executives get paid the least.
In [201]:
# impute missing values in MonthlyIncome based on Occupation and Designation
df2["MonthlyIncome"].fillna(
    value=df2.groupby(["Occupation", "Designation"])["MonthlyIncome"].transform(
        np.median
    ),
    inplace=True,
)
In [202]:
# Age varies based on Gender and Designation
plt.figure(figsize=(15, 10))
sns.set(font_scale=1)
sns.boxplot(data=df2, x="Gender", y="Age", hue="Designation")
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
Out[202]:
<matplotlib.legend.Legend at 0x272d93d2bb0>
  • Age is similar across Designation for Male and Female, with differences in the designations of AVP and VP.
In [203]:
# impute missing values in Age based on Gender and Designation
df2["Age"].fillna(
    value=df2.groupby(["Gender", "Designation"])["Age"].transform(np.median),
    inplace=True,
)
In [204]:
# NumberOfChildrenVisiting was moderately correlated with NumberOfPersonVisiting
df2["NumberOfChildrenVisiting"].fillna(
    value=df2.groupby(["NumberOfPersonVisiting"])["NumberOfChildrenVisiting"].transform(
        np.median
    ),
    inplace=True,
)
In [205]:
# DurationOfPitch most likely varies with what product is being pitched
plt.figure(figsize=(15, 8))
sns.boxplot(data=df2, y="DurationOfPitch", x="ProductPitched")
Out[205]:
<AxesSubplot:xlabel='ProductPitched', ylabel='DurationOfPitch'>
  • Duration of Pitch varies by the type of product pitched.
In [206]:
# impute missing values in DurationOfPitch
df2["DurationOfPitch"].fillna(
    value=df2.groupby(["ProductPitched"])["DurationOfPitch"].transform(np.median),
    inplace=True,
)
In [207]:
# relationships with NumberOfTrips across categorical variables
plt.figure(figsize=(25, 30))
sns.set(font_scale=2)
for i, name in enumerate(cat_cols):
    plt.subplot(5, 2, i + 1)
    sns.boxplot(data=df2, x=name, y="NumberOfTrips")
    plt.tight_layout()
    plt.title(name)
  • Median number of trips across different categorical variables is fairly steady around 3.
  • Occupation shows the biggest difference in one category mainly due to the limited data available on customers working as Freelancers.
  • MaritalStatus shows a difference in the median for single vs. non-single customers.
In [208]:
# check how many missing values are freelancers
df2[df2.NumberOfTrips.isna()].Designation.value_counts()
Out[208]:
VP                82
AVP               50
Executive          5
Manager            2
Senior Manager     1
Name: Designation, dtype: int64
In [209]:
# impute NumberOfTrips according to MaritalStatus since there are no freelancers in the missing data
df2["NumberOfTrips"].fillna(
    value=df2.groupby(["MaritalStatus"])["NumberOfTrips"].transform(np.median),
    inplace=True,
)
In [210]:
# check how many missing alues are left
df2.isna().sum()[df2.isna().sum() > 0]
Out[210]:
TypeofContact            25
NumberOfFollowups        45
PreferredPropertyStar    26
dtype: int64
  • Since there are only a few missing values for the remaining columns, we will impute the missing values with either the mode or median.
In [211]:
# impute the remaining column missing variables with the median or mode
df2["TypeofContact"].fillna("Self Enquiry", inplace=True)  # Self Inquiry is the mode
df2["PreferredPropertyStar"].fillna(
    value=df2["PreferredPropertyStar"].median(), inplace=True
)
df2["NumberOfFollowups"].fillna(value=df2["NumberOfFollowups"].median(), inplace=True)
In [212]:
# confirm that there are no more missing values
df2.isna().sum()[df2.isna().sum() > 0]
Out[212]:
Series([], dtype: int64)
In [213]:
# check correlations after imputations to see if anything has changed
sns.set(font_scale=1)
plt.figure(figsize=(15, 7))
sns.heatmap(df2.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • No correlations have increased significantly after imputation of values.

Part 3: Model Building and Evaluation¶

In [286]:
# convert categorical variables to category datatype
df3 = df2.copy()
df3[cat_cols] = df3[cat_cols].astype("category")

We will drop the customer interaction data since we are trying to predict new, potential customers before pitching the package to them.

In [287]:
# create X and Y variables
X = df3.drop(
    [
        "ProdTaken",
        "PitchSatisfactionScore",
        "ProductPitched",
        "NumberOfFollowups",
        "DurationOfPitch",
    ],
    axis=1,
)
y = df3["ProdTaken"]

# get dummy variables
X = pd.get_dummies(X, drop_first=True)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=101, stratify=y
)
In [288]:
# verify that training and test sets have same distribution of 0s and 1s
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3417, 22)
Shape of test set :  (1465, 22)
Percentage of classes in training set:
0    0.811823
1    0.188177
Name: ProdTaken, dtype: float64
Percentage of classes in test set:
0    0.812287
1    0.187713
Name: ProdTaken, dtype: float64
  • Training and test data have same splits in the dependent variable.

Model evaluation criterion¶

Model can make wrong predictions as:¶

  1. Predicting a customer will purchase the package but in reality the customer would not purchase it. - Loss of resources

  2. Predicting a customer will not purchase the package but in reality the customer would have purchased the package. - Loss of opportunity and revenue

Which case is more important?¶

  • Predicting a customer who would have purchased the package as someone who would not, would be more detrimental to the business.

How to reduce this loss?¶

  • recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.
In [289]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [290]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Part 4: Decision Tree and Bagging Models¶

Decision Tree¶

In [291]:
# Fitting a decision tree model with default parameters
d_tree = DecisionTreeClassifier(random_state=101)
d_tree.fit(X_train, y_train)

# Calculating different metrics
dtree_model_train_perf = model_performance_classification_sklearn(
    d_tree, X_train, y_train
)
print("Training performance:\n", dtree_model_train_perf)
dtree_model_test_perf = model_performance_classification_sklearn(d_tree, X_test, y_test)
print("Testing performance:\n", dtree_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(d_tree, X_test, y_test)
Training performance:
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Testing performance:
    Accuracy    Recall  Precision        F1
0  0.874403  0.654545   0.669145  0.661765
  • The decision tree is overfitting the training data as there is a huge difference between training and test scores for all the metrics.
  • The test recall is 65%.

Tuned Decision Tree¶

In [314]:
# Choose the type of classifier.
dtree_tuned = DecisionTreeClassifier(class_weight={0: 0.19, 1: 0.81}, random_state=101)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 20),
    "min_samples_leaf": [2, 5, 7, 10, 15],
    "max_leaf_nodes": [2, 3, 5, 10, 15],
    "min_impurity_decrease": [0.0001, 0.001, 0.01, 0.1],
    "criterion": ["entropy", "gini"],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer, n_jobs=-1, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)
Out[314]:
DecisionTreeClassifier(class_weight={0: 0.19, 1: 0.81}, max_depth=7,
                       max_leaf_nodes=15, min_impurity_decrease=0.0001,
                       min_samples_leaf=2, random_state=101)
In [315]:
# Calculating different metrics
dtree_tuned_model_train_perf = model_performance_classification_sklearn(
    dtree_tuned, X_train, y_train
)
print("Training performance:\n", dtree_tuned_model_train_perf)
dtree_tuned_model_test_perf = model_performance_classification_sklearn(
    dtree_tuned, X_test, y_test
)
print("Testing performance:\n", dtree_tuned_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(dtree_tuned, X_test, y_test)
Training performance:
    Accuracy    Recall  Precision        F1
0  0.752414  0.720062   0.410097  0.522573
Testing performance:
    Accuracy    Recall  Precision        F1
0   0.74471  0.647273   0.391209  0.487671
  • The decision tree is giving a more generalized performance after hyperparameter tuning.
  • Recall stayed about the same, while performance on precision decreased.

Random Forest Classifier¶

In [294]:
# Fitting the model
rf_estimator = RandomForestClassifier(random_state=101)
rf_estimator.fit(X_train, y_train)

# Calculating different metrics
rf_estimator_model_train_perf = model_performance_classification_sklearn(
    rf_estimator, X_train, y_train
)
print("Training performance:\n", rf_estimator_model_train_perf)
rf_estimator_model_test_perf = model_performance_classification_sklearn(
    rf_estimator, X_test, y_test
)
print("Testing performance:\n", rf_estimator_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(rf_estimator, X_test, y_test)
Training performance:
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Testing performance:
    Accuracy    Recall  Precision        F1
0  0.881911  0.443636   0.859155  0.585132
  • Random forest is overfitting the training data as there is a huge difference between training and test scores for all the metrics.
  • The test recall is even lower than the decision tree but has a higher test precision.

Tuned Random Forest Classifier¶

In [295]:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(class_weight={0: 0.19, 1: 0.81}, random_state=101)

parameters = {
    "max_depth": list(np.arange(5, 20, 5)) + [None],
    "max_features": ["sqrt", "log2", None],
    "max_samples": [0.3, 0.7, 1.0],
    "n_estimators": np.arange(10, 200, 50),
    "min_samples_leaf": np.arange(2, 10),
}


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
Out[295]:
RandomForestClassifier(class_weight={0: 0.19, 1: 0.81}, max_depth=5,
                       max_features='sqrt', max_samples=1.0, min_samples_leaf=6,
                       n_estimators=10, random_state=101)
In [296]:
# Calculating different metrics
rf_tuned_model_train_perf = model_performance_classification_sklearn(
    rf_tuned, X_train, y_train
)
print("Training performance:\n", rf_tuned_model_train_perf)
rf_tuned_model_test_perf = model_performance_classification_sklearn(
    rf_tuned, X_test, y_test
)
print("Testing performance:\n", rf_tuned_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(rf_tuned, X_test, y_test)
Training performance:
    Accuracy    Recall  Precision        F1
0  0.788704  0.690513   0.459152  0.551553
Testing performance:
    Accuracy    Recall  Precision       F1
0  0.787713  0.603636   0.451087  0.51633
  • The test recall has increased significantly after hyperparameter tuning and the model is giving a more generalized performance.
  • The test recall is still not as high as the decision tree.

Bagging Classifier¶

In [297]:
# Fitting the model
bagging_classifier = BaggingClassifier(random_state=101)
bagging_classifier.fit(X_train, y_train)

# Calculating different metrics
bagging_classifier_model_train_perf = model_performance_classification_sklearn(
    bagging_classifier, X_train, y_train
)
print("Training performance:\n", bagging_classifier_model_train_perf)
bagging_classifier_model_test_perf = model_performance_classification_sklearn(
    bagging_classifier, X_test, y_test
)
print("Testing performance:\n", bagging_classifier_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bagging_classifier, X_test, y_test)
Training performance:
    Accuracy   Recall  Precision        F1
0  0.992391  0.96112   0.998384  0.979398
Testing performance:
    Accuracy    Recall  Precision        F1
0  0.900341  0.578182    0.84127  0.685345
  • Bagging classifier giving a higher test recall than random forest.
  • It is also overfitting the training data and lower test recall than decision trees.

Tuned Bagging Classifier¶

In [298]:
# Choose the type of classifier.
bagging_tuned = BaggingClassifier(
    DecisionTreeClassifier(class_weight={0: 0.19, 1: 0.81}, random_state=101),
    random_state=101,
)

# Grid of parameters to choose from
parameters = {
    "max_samples": [0.7, 0.8, 0.9, 1],
    "max_features": [0.7, 0.8, 0.9, 1],
    "n_estimators": [10, 20, 30, 40, 50],
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(bagging_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
bagging_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_tuned.fit(X_train, y_train)
Out[298]:
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.19,
                                                                      1: 0.81},
                                                        random_state=101),
                  max_features=0.9, max_samples=0.9, n_estimators=30,
                  random_state=101)
In [299]:
# Calculating different metrics
bagging_tuned_model_train_perf = model_performance_classification_sklearn(
    bagging_tuned, X_train, y_train
)
print("Training performance:\n", bagging_tuned_model_train_perf)
bagging_tuned_model_test_perf = model_performance_classification_sklearn(
    bagging_tuned, X_test, y_test
)
print("Testing performance:\n", bagging_tuned_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(bagging_tuned, X_test, y_test)
Training performance:
    Accuracy    Recall  Precision        F1
0  0.996488  0.982893    0.99842  0.990596
Testing performance:
    Accuracy    Recall  Precision        F1
0  0.887372  0.476364   0.861842  0.613583
  • Surprisingly, the test recall has decreased after hyperparameter tuning and the model is still overfitting the training data.
  • The confusion matrix shows that the model is not good at identifying customers who will purchase the package.

Part 5: Boosting and Stacking Models¶

AdaBoost Classifier¶

In [300]:
# Fitting the model
ab_classifier = AdaBoostClassifier(random_state=101)
ab_classifier.fit(X_train, y_train)

# Calculating different metrics
ab_classifier_model_train_perf = model_performance_classification_sklearn(
    ab_classifier, X_train, y_train
)
print(ab_classifier_model_train_perf)
ab_classifier_model_test_perf = model_performance_classification_sklearn(
    ab_classifier, X_test, y_test
)
print(ab_classifier_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(ab_classifier, X_test, y_test)
   Accuracy    Recall  Precision        F1
0  0.846064  0.293935   0.724138  0.418142
   Accuracy    Recall  Precision        F1
0  0.845051  0.272727   0.735294  0.397878
  • The model generalizes well, but does poorly on recall for train and test data.

Tuned AdaBoost Classifier¶

In [301]:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=101)

# Grid of parameters to choose from
parameters = {
    # Let's try different max_depth for base_estimator
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1),
        DecisionTreeClassifier(max_depth=2),
        DecisionTreeClassifier(max_depth=3),
    ],
    "n_estimators": np.arange(10, 110, 20),
    "learning_rate": np.arange(0.1, 2, 0.1),
}

# Type of scoring used to compare parameter  combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
Out[301]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
                   learning_rate=0.8, n_estimators=90, random_state=101)
In [302]:
# Calculating different metrics
abc_tuned_model_train_perf = model_performance_classification_sklearn(
    abc_tuned, X_train, y_train
)
print(abc_tuned_model_train_perf)
abc_tuned_model_test_perf = model_performance_classification_sklearn(
    abc_tuned, X_test, y_test
)
print(abc_tuned_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(abc_tuned, X_test, y_test)
   Accuracy    Recall  Precision        F1
0  0.965759  0.869362   0.944257  0.905263
   Accuracy  Recall  Precision        F1
0  0.886007    0.56       0.77  0.648421
  • The model performance has increased especially on the test recall but the model has started to overfit the training data.
  • The decision tree still has the highest test recall.

Gradient Boosting Classifier¶

In [303]:
# Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=101)
gb_classifier.fit(X_train, y_train)

# Calculating different metrics
gb_classifier_model_train_perf = model_performance_classification_sklearn(
    gb_classifier, X_train, y_train
)
print("Training performance:\n", gb_classifier_model_train_perf)
gb_classifier_model_test_perf = model_performance_classification_sklearn(
    gb_classifier, X_test, y_test
)
print("Testing performance:\n", gb_classifier_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(gb_classifier, X_test, y_test)
Training performance:
    Accuracy    Recall  Precision        F1
0  0.875622  0.416796   0.842767  0.557752
Testing performance:
    Accuracy    Recall  Precision        F1
0  0.853925  0.316364   0.769912  0.448454
  • The gradient boosting classifier is overfitting the training data on recall and precision.
  • The model has very poor performance on recall.

Tuned Gradient Boosting Classifier¶

In [304]:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(
    init=AdaBoostClassifier(random_state=101), random_state=101
)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [100, 150, 200, 250],
    "subsample": [0.8, 0.9, 1],
    "max_features": [0.7, 0.8, 0.9, 1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
Out[304]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=101),
                           max_features=0.7, n_estimators=250, random_state=101,
                           subsample=1)
In [305]:
# Calculating different metrics
gbc_tuned_model_train_perf = model_performance_classification_sklearn(
    gbc_tuned, X_train, y_train
)
print("Training performance:\n", gbc_tuned_model_train_perf)
gbc_tuned_model_test_perf = model_performance_classification_sklearn(
    gbc_tuned, X_test, y_test
)
print("Testing performance:\n", gbc_tuned_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(gbc_tuned, X_test, y_test)
Training performance:
    Accuracy    Recall  Precision       F1
0  0.906351  0.556765   0.910941  0.69112
Testing performance:
    Accuracy  Recall  Precision       F1
0  0.864164     0.4   0.763889  0.52506
  • The test recall has improved after hyperparameter tuning, but the metric is still low.
  • The model overfits on precision.

XGBoost Classifier¶

In [306]:
# Fitting the model
xgb_classifier = XGBClassifier(random_state=101, eval_metric="logloss")
xgb_classifier.fit(X_train, y_train)

# Calculating different metrics
xgb_classifier_model_train_perf = model_performance_classification_sklearn(
    xgb_classifier, X_train, y_train
)
print("Training performance:\n", xgb_classifier_model_train_perf)
xgb_classifier_model_test_perf = model_performance_classification_sklearn(
    xgb_classifier, X_test, y_test
)
print("Testing performance:\n", xgb_classifier_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(xgb_classifier, X_test, y_test)
Training performance:
    Accuracy    Recall  Precision       F1
0  0.997659  0.987558        1.0  0.99374
Testing performance:
    Accuracy  Recall  Precision        F1
0  0.894881    0.56   0.823529  0.666667
  • xgboost classifier is overfitting the training data.
  • Let's try hyperparameter tuning and see if the model performance improves.

Tuned XGBoost Classifier¶

In [307]:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=101, eval_metric="logloss")

# Grid of parameters to choose from
parameters = {
    "n_estimators": [10, 30, 50],
    "scale_pos_weight": [1, 2, 5],
    "subsample": [0.7, 0.9, 1],
    "learning_rate": [0.05, 0.1, 0.2],
    "colsample_bytree": [0.7, 0.9, 1],
    "colsample_bylevel": [0.5, 0.7, 1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
Out[307]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=0.5, colsample_bynode=None,
              colsample_bytree=0.7, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=30, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=101, ...)
In [308]:
# Calculating different metrics
xgb_tuned_model_train_perf = model_performance_classification_sklearn(
    xgb_tuned, X_train, y_train
)
print("Training performance:\n", xgb_tuned_model_train_perf)
xgb_tuned_model_test_perf = model_performance_classification_sklearn(
    xgb_tuned, X_test, y_test
)
print("Testing performance:\n", xgb_tuned_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(xgb_tuned, X_test, y_test)
Training performance:
    Accuracy    Recall  Precision        F1
0  0.840211  0.835148   0.549642  0.662963
Testing performance:
    Accuracy  Recall  Precision        F1
0  0.817747    0.72   0.510309  0.597285
  • The tuned XGBoost generalizes well on accuracy and overfits on recall.
  • The model gives the highest test recall of all models.
  • The tuned model was able to improve the test recall from 0.56 to 0.72.

Stacking Model¶

In [316]:
# we will combine models that have generalized performance, and the better recall
estimators = [
    ("Decision Tree", dtree_tuned),
    ("Random Forest", rf_tuned),
    ("Gradient Boosting", gbc_tuned),
]

final_estimator = xgb_tuned

stacking_classifier = StackingClassifier(
    estimators=estimators, final_estimator=final_estimator
)

stacking_classifier.fit(X_train, y_train)
Out[316]:
StackingClassifier(estimators=[('Decision Tree',
                                DecisionTreeClassifier(class_weight={0: 0.19,
                                                                     1: 0.81},
                                                       max_depth=7,
                                                       max_leaf_nodes=15,
                                                       min_impurity_decrease=0.0001,
                                                       min_samples_leaf=2,
                                                       random_state=101)),
                               ('Random Forest',
                                RandomForestClassifier(class_weight={0: 0.19,
                                                                     1: 0.81},
                                                       max_depth=5,
                                                       max_features='sqrt',
                                                       max_samples=1.0,
                                                       min_samples_leaf=6,
                                                       n_estimators=10,
                                                       ran...
                                                 gpu_id=None, grow_policy=None,
                                                 importance_type=None,
                                                 interaction_constraints=None,
                                                 learning_rate=0.05,
                                                 max_bin=None,
                                                 max_cat_threshold=None,
                                                 max_cat_to_onehot=None,
                                                 max_delta_step=None,
                                                 max_depth=None,
                                                 max_leaves=None,
                                                 min_child_weight=None,
                                                 missing=nan,
                                                 monotone_constraints=None,
                                                 n_estimators=30, n_jobs=None,
                                                 num_parallel_tree=None,
                                                 predictor=None,
                                                 random_state=101, ...))
In [317]:
# Calculating different metrics
stacking_classifier_model_train_perf = model_performance_classification_sklearn(
    stacking_classifier, X_train, y_train
)
print("Training performance:\n", stacking_classifier_model_train_perf)
stacking_classifier_model_test_perf = model_performance_classification_sklearn(
    stacking_classifier, X_test, y_test
)
print("Testing performance:\n", stacking_classifier_model_test_perf)

# Creating confusion matrix
confusion_matrix_sklearn(stacking_classifier, X_test, y_test)
Training performance:
    Accuracy    Recall  Precision        F1
0  0.842552  0.808709    0.55615  0.659062
Testing performance:
    Accuracy    Recall  Precision        F1
0  0.831399  0.734545   0.537234  0.620584
  • The stacking classifier is giving a similar performance as compared to XGBoost with slightly less overfitting on accuracy and recall.
  • The confusion matrix shows that the model can identify the majority of customers who will purchase a package but it is better at identifying those who will not purchase a package in the test data.

Part 6: Comparing all models¶

In [318]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        dtree_model_train_perf.T,
        dtree_tuned_model_train_perf.T,
        rf_estimator_model_train_perf.T,
        rf_tuned_model_train_perf.T,
        bagging_classifier_model_train_perf.T,
        bagging_tuned_model_train_perf.T,
        ab_classifier_model_train_perf.T,
        abc_tuned_model_train_perf.T,
        gb_classifier_model_train_perf.T,
        gbc_tuned_model_train_perf.T,
        xgb_classifier_model_train_perf.T,
        xgb_tuned_model_train_perf.T,
        stacking_classifier_model_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Tuned",
    "Random Forest",
    "Random Forest Tuned",
    "Bagging Classifier",
    "Bagging Classifier Tuned",
    "Adaboost Classifier",
    "Adabosst Classifier Tuned",
    "Gradient Boost Classifier",
    "Gradient Boost Classifier Tuned",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[318]:
Decision Tree Decision Tree Tuned Random Forest Random Forest Tuned Bagging Classifier Bagging Classifier Tuned Adaboost Classifier Adabosst Classifier Tuned Gradient Boost Classifier Gradient Boost Classifier Tuned XGBoost Classifier XGBoost Classifier Tuned Stacking Classifier
Accuracy 1.0 0.752414 1.0 0.788704 0.992391 0.996488 0.846064 0.965759 0.875622 0.906351 0.997659 0.840211 0.842552
Recall 1.0 0.720062 1.0 0.690513 0.961120 0.982893 0.293935 0.869362 0.416796 0.556765 0.987558 0.835148 0.808709
Precision 1.0 0.410097 1.0 0.459152 0.998384 0.998420 0.724138 0.944257 0.842767 0.910941 1.000000 0.549642 0.556150
F1 1.0 0.522573 1.0 0.551553 0.979398 0.990596 0.418142 0.905263 0.557752 0.691120 0.993740 0.662963 0.659062
In [319]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        dtree_model_test_perf.T,
        dtree_tuned_model_test_perf.T,
        rf_estimator_model_test_perf.T,
        rf_tuned_model_test_perf.T,
        bagging_classifier_model_test_perf.T,
        bagging_tuned_model_test_perf.T,
        ab_classifier_model_test_perf.T,
        abc_tuned_model_test_perf.T,
        gb_classifier_model_test_perf.T,
        gbc_tuned_model_test_perf.T,
        xgb_classifier_model_test_perf.T,
        xgb_tuned_model_test_perf.T,
        stacking_classifier_model_test_perf.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Tuned",
    "Random Forest",
    "Random Forest Tuned",
    "Bagging Classifier",
    "Bagging Classifier Tuned",
    "Adaboost Classifier",
    "Adabosst Classifier Tuned",
    "Gradient Boost Classifier",
    "Gradient Boost Classifier Tuned",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[319]:
Decision Tree Decision Tree Tuned Random Forest Random Forest Tuned Bagging Classifier Bagging Classifier Tuned Adaboost Classifier Adabosst Classifier Tuned Gradient Boost Classifier Gradient Boost Classifier Tuned XGBoost Classifier XGBoost Classifier Tuned Stacking Classifier
Accuracy 0.874403 0.744710 0.881911 0.787713 0.900341 0.887372 0.845051 0.886007 0.853925 0.864164 0.894881 0.817747 0.831399
Recall 0.654545 0.647273 0.443636 0.603636 0.578182 0.476364 0.272727 0.560000 0.316364 0.400000 0.560000 0.720000 0.734545
Precision 0.669145 0.391209 0.859155 0.451087 0.841270 0.861842 0.735294 0.770000 0.769912 0.763889 0.823529 0.510309 0.537234
F1 0.661765 0.487671 0.585132 0.516330 0.685345 0.613583 0.397878 0.648421 0.448454 0.525060 0.666667 0.597285 0.620584
  • The majority of the models are overfitting the training data in terms of recall score.
  • The tuned XGBoost classifier and Stacking classifier give similar recall of 0.72-0.73 with generalized performance on accuracy.
  • I would recommend the tuned XGBoost classifier due to less model complexity and not needing to stack multiple models.
In [320]:
# feature importances
feature_names = X_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Having a passport is the most important feature in identifying which customers will purchase a package followed by being an Executive, being single and living in a Tier 3 city.

Part 7: Conclusions and Recommendations¶

Summary/Insights¶

Data Background:

  • There are 4888 rows and 20 columns in the raw data.
  • 8 columns had null values.
  • All columns were object, float or int type.

Data Preprocessing:

  • The CustomerID column was dropped as it isn’t important to the analysis.
  • Replaced "Fe Male" with "Female" in Gender.
  • DurationOfPitch, MonthlyIncome, and NumberOfTrips had a few extreme outliers that were either imputed or dropped.
  • Missing values were imputed - either with the column median, or based on grouping of other variables.
  • Dummy variables were created for building the models.
  • Customer interaction data was dropped in building the model since this model is to be used before pitching to customers.

Observations from EDA:

  • Most customers make contact by Self Enquiry, are male, are pitched the Basic product, are married, hold Executive designation, did not purchase a package, are from Tier 1 cities, have a passport, own a car, have a median age of 36, have 1 child visiting, go on 3 trips/year, and make a median of 22k.
  • The target variable, ProdTaken, is most correlated with having a passport.
  • Of the customers who bought a package, those customers tend to be younger, had a longer pitch, made a lower median wage, and had higher property ratings, work in Small Business, bought the basic package, were an Executive, come from lower tier cities and have a passport.
  • Basic is the most popular package bought, followed by the mid-tier option (Deluxe).
  • The most expensive packages (Super Deluxe and King) are the least purchased.

Customer Profiles:

  1. Basic: made up of Executives, youngest median age of all packages (30), lowest median income customers (21k), preferred by Freelancers
  2. Standard: made up of senior Managers, median age of 38, more married customers than unmarried
  3. Deluxe: made up of Managers, median age of 36, from Tier 3 cities mostly
  4. Super Deluxe: made up of AVP, older customers from 39-56 with a median age of 42, required a lower median number of followups (3) compared to other packages (4), more contacted through company invite, all customers own a car
  5. King: made up of VP, oldest customers with a median age of 53, make the highest median monthly income of 35k, shortest median duration of pitch compared to all other packages, all contact from Self Enquiry, more female customers, more customers from Small Business

Model Building and Performance:

Models were created to predict whether or not a customer will purchase a package. Recall was the chosen model evaluation metric to minimize false negatives. A decision tree, random forest, bagging classifier, AdaBoost classifier, Gradient Boosting classifier, XGBoost Classifier were built and tuned with hyperparameters. A stacking model was built that combined the best individual models. The final performance gave 0.72 on the test set. Based on the chosen XGBoost model, having a passport, being an Executive, being single and living in a Tier 3 city are the most significant variables for determining whether a customer will purchase a package.

Recommendations¶

  • An ensemble model was successfully built that can be used by the company to target new customers for the new travel package.
  • The bank should focus on marketing packages to customers that have a passport, are an Executive, single and living in Tier 3 city.
  • It is recommended that the business gather more data on other features like miles driven or flown on average each year. To market the Wellness package, they should gather health data from the customers.
  • It is recommended to build a separate model with the customer interaction data to know how significant the duration of the pitch and number of followups can be to convincing a customer to purchase a package.
In [ ]: